Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate make_scalar_function #8878

Merged
merged 11 commits into from
Jan 22, 2024
Merged

Conversation

viirya
Copy link
Member

@viirya viirya commented Jan 15, 2024

Which issue does this PR close?

Closes #8866.

Rationale for this change

make_scalar_function is easily to be misused and causing unexpected behavior (e.g., #8866) if users don't understand its underlying logic. Actually this function looks more like a crate private utility function we should not expose.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 15, 2024
pub fn make_scalar_function<F>(inner: F) -> ScalarFunctionImplementation
/// Note that this function makes a scalar function with no arguments or all scalar inputs return a scalar.
/// That's said its output will be same for all input rows in a batch.
pub(crate) fn make_scalar_function<F>(inner: F) -> ScalarFunctionImplementation
Copy link
Member Author

@viirya viirya Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we should deprecate it and ask user to use ScalarUDFImpl instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a behavior change, so I add a label for it

@jackwener jackwener added the api change Changes the API exposed to users of the crate label Jan 16, 2024
jackwener
jackwener previously approved these changes Jan 16, 2024
Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me, thanks @viirya

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya -- I now understand the issue in #8866 better.

I think make_scalar_function is quite a nice function that lowered the cognative barrier to making udfs and it would be great if we could keep it. I think @jorgecarleitao or @andygrove originally added it when we introduced ColumnarValue so that new users didn't have to understand ColumnarValue to begin working on it.

That being said, perhaps encouraging people to use the trait ScalarUDFImpl is ok,

datafusion/physical-expr/src/functions.rs Outdated Show resolved Hide resolved
let ColumnarValue::Array(array) = &args[0] else {
panic!()
};
Ok(ColumnarValue::Array(Arc::new(array.clone()) as ArrayRef))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could make this code easier to work with using From impls, like

Suggested change
Ok(ColumnarValue::Array(Arc::new(array.clone()) as ArrayRef))
Ok(ColumnarValue::from(Arc::new(array.clone()) as ArrayRef))

Maybe it doesn't matter 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I added two From impls for ColumnarValue.

// in DataFusion, all `args` and output are dynamically-typed arrays, which means that we need to:
// 1. cast the values to the type we want
// 2. perform the computation for every element in the array (using a loop or SIMD) and construct the result

// this is guaranteed by DataFusion based on the function's signature.
assert_eq!(args.len(), 2);

let ColumnarValue::Array(arg0) = &args[0] else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more correct pattern is to create an array with into_array

Suggested change
let ColumnarValue::Array(arg0) = &args[0] else {
let arg0 = args[0].into_array(num_rows)

Though now I wrote that it is not super clear to me where the num_rows should come from 🤔

Also, maybe this is more of the fix for #8866 -- that for a volatile function, it shouldn't be passed a scalar but unstead the array should be expanded prior to invocation

Copy link
Member

@jackwener jackwener Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations would arg of pow be a Scalar? If we can provide an example to reproduce it, our case/test can cover more condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like pow(1.0, 2.0) (pass in literal values)

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jan 17, 2024
@viirya viirya changed the title Make make_scalar_function private Deprecate make_scalar_function Jan 17, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya --- I think it would be nice to avoid some of the repetition of converting [ColumnarValue] into [ArrayRef] by encapsulating it into a function but I also think we can do so as a follow on PR.

I also think we might want to consieer wait for DataFusion 35.0.0 to be released #8863 before merging this PR

let arg0 = as_int32_array(&args[0])?;
let arg1 = as_int32_array(&args[1])?;
let fun = Arc::new(|args: &[ColumnarValue]| {
let len = args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a follow on PR we can make a function that does this length inference and conversion to array (mostly so it can be documented) to make the ideas easier to find 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, have the same thought too. It will be still useful for ScalarUDFImpl as invoke takes args: &[ColumnarValue] and users might need length inference too.

@viirya
Copy link
Member Author

viirya commented Jan 19, 2024

I also think we might want to consieer wait for DataFusion 35.0.0 to be released #8863 before merging this PR

Thanks @alamb. This currently marks make_scalar_function as deprecated since 35.0.0 (#[deprecated(since = "35.0.0"...), if we want to merge this after DataFusion 35.0.0 releases, should we change the deprecation message to since = "36.0.0")?

@alamb
Copy link
Contributor

alamb commented Jan 22, 2024

I am going to merge this PR on the assumption that the 35.0.0 release is in the final stages of voting and we won't need a new RC.

Thanks again @viirya

@alamb
Copy link
Contributor

alamb commented Jan 22, 2024

I took the liberty of merging up from main just to ensure there are no logical conflicts and plan to merge this PR when the CI checks pass

@viirya
Copy link
Member Author

viirya commented Jan 22, 2024

Thank you @alamb @jackwener

@alamb alamb merged commit 903ef94 into apache:main Jan 22, 2024
22 checks passed
@dadepo
Copy link

dadepo commented Jan 23, 2024

The documentation here https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html probably needs updating too, as it still suggest to use make_scalar_function

@viirya
Copy link
Member Author

viirya commented Jan 23, 2024

Thanks. I will update the document together in a follow up with other stuffs (e.g., #8878 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UDFs marked as volatile do not appear to evaluate multiple times for each output row.
4 participants